Common Pitfalls Using the Normalized Compression Distance: What to Watch out for in a Compressor

نویسندگان

  • MANUEL CEBRIÁN
  • MANUEL ALFONSECA
  • ALFONSO ORTEGA
چکیده

Using the mathematical background for algorithmic complexity developed by Kolmogorov in the sixties, Cilibrasi and Vitanyi have designed a similarity distance named normalized compression distance applicable to the clustering of objects of any kind, such as music, texts or gene sequences. The normalized compression distance is a quasi-universal normalized admissible distance under certain conditions. This paper shows that the compressors used to compute the normalized compression distance are not idempotent in some cases, being strongly skewed with the size of the objects and window size, and therefore causing a deviation in the identity property of the distance if we don’t take care that the objects to be compressed fit the windows. The relationship underlying the precision of the distance and the size of the objects has been analyzed for several well-known compressors, and specially in depth for three cases, bzip2, gzip and PPMZ which are examples of the three main types of compressors: block-sorting, Lempel-Ziv, and statistic.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalized Distance Matrix Method for Construction of Phylogenetic Trees Using New Compressor - Dnabit Compress

We define a compression distance, based on a normal compressor to show it is an admissible distance. The first theme concerns the statistical significance of compressed file sizes. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. In applying the approach, we have used a new DNA sequence compresso...

متن کامل

Development of a compression system dynamic simulation code for testing and designing of anti-surge control system

In recent years, several research activities have been conducted to develop knowledge in analysis, design and optimization of compressor anti-surge control system. Since the anti-surge control testing on a full-scale compressor is limited to possible consequences of failure, and also the experimental facility can be expensive to set up control strategies and logic, design process often involves...

متن کامل

Investigation of energy consumption reduction in multistage compression process and its solutions

During hot seasons the inlet temperature of Nitrogen increases, as a result compressor consumes more power for compressing a specific mass ratio of fluid and consequently total energy consumption of the compressor increases as well. In this research, a three stage centrifugal compressor with intercooler was modeled thermodynamically in order to decreases the energy consumption of the compressor...

متن کامل

Operation Analysis of Rotary Tools of Compressor Station Using Exergy Approach

In this study, operation of compressor station has been investigated by exergy approach. Exergy analysis is a thermodynamic method which shows the irreversibility of a system quantitatively. Gas compressors are used to compensate the pressure drop along the gas pipeline significantly. The compression process causes temperature rise of gas; in this regard gas cooler is applied to reduce the temp...

متن کامل

Clustering the Normalized Compression Distance for Virus Data

The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the HA sequences of virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip are compared. Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005